Design Patterns for Checkpoint-Based Rollback Recovery
نویسنده
چکیده
Checkpoint-based rollback recovery is a very popular category of fault tolerance techniques, which are based on a simple idea: save the system state during error-free portions of the system execution; when an error occurs, use the saved state to rollback the system to a recent consistent state. This way, after an error occurs the system does not have to start its execution from the beginning, which would result in longer execution times or even failure of the system (e.g. when the I/O events that drove the execution of the system are not reproducible). This paper presents three design patterns that capture the most widely used methods for checkpoint-based rollback recovery. The Independent Checkpoint pattern describes the method where constituent components of a system take checkpoints without synchronizing with each other. Synchronization will take place after the occurrence of an error when a consistent system state must be re-established from the partial system states found in the checkpoints. The Coordinated Checkpoint pattern describes the method where constituent components of a system take checkpoints after synchronizing with each other. In this case, no synchronization is required during the re-establishment of a consistent system state after the occurrence of an error. Finally, the Communication-Induced Checkpoint pattern describes the methods where the synchronization of the checkpointing is triggered by communication events. This method combines the benefits of the previous two in terms of time and space overhead incurred to the system execution.
منابع مشابه
Design Patterns for Log-Based Rollback Recovery
Log-based rollback recovery builds on the ideas of checkpoint-based rollback recovery and improves the characteristics of the recovery process. The basic idea capture by the log-based rollback recovery techniques is an extension of the checkpoint idea. Only, instead of relying solely on checkpoints for recovering from the occurrence of an error, the system logs information about the non-determi...
متن کاملDependency-Aware Rollback and Checkpoint-Restart for Distributed Task-Based Runtimes
With the increase in compute nodes in large compute platforms, a proportional increase in node failures will follow. Many application-based checkpoint/restart (C/R) techniques have been proposed for MPI applications to target the reduced mean time between failures. However, rollback as part of the recovery remains a dominant cost even in highly optimised MPI applications employing C/R technique...
متن کاملA Dynamic Checkpointing and Rollback Recovery Solution Based on Task Switching
Fault tolerance is an important issue in operating system. Checkpointing and Rollback Recovery (CRR) is a key technique to fault tolerance. Its simplicity and effectiveness make it widely applied to fault maintenance of operating system. CRR can be divided into checkpoint storage and restoration. And checkpoint storage is key factor to real-time of checkpoint recovery. Current checkpoint storag...
متن کاملDesign and Analysis of an Efficient Energy Algorithm in Wireless Social Sensor Networks
Because mobile ad hoc networks have characteristics such as lack of center nodes, multi-hop routing and changeable topology, the existing checkpoint technologies for normal mobile networks cannot be applied well to mobile ad hoc networks. Considering the multi-frequency hierarchy structure of ad hoc networks, this paper proposes a hybrid checkpointing strategy which combines the techniques of s...
متن کاملDynamic Node Recovery in MANET for High Recovery Probability
One of the key design issues in ad hoc networks is the development of rollback recovery model for providing faulttolerance in MANET. Because the potential problem of MANET is limited energy, probability of fault occurrences is more. Hence, checkpointing is done at trusted nodes when faults are encountered, for successful rollback to the last saved state. This makes trust a vital factor to be de...
متن کامل